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Abstract 

Finitarily Markovian processes are those processes {Xn}^=_oc, for 
which there is a finite K {K = K{{Xn}%,=-ao) such that the condi- 
tional distribution of Xi given the entire past is equal to the condi- 
tional distribution of Xi given only {-^n}n=i_/<- The least such value 
of K is called the memory length. We give a rather complete analysis 
of the problems of universally estimating the least such value of 
both in the backward sense that we have just described and in the 
forward sense, where one observes successive values of {-'^^n} for n > 
and asks for the least value K such that the conditional distribution of 
Xn+i given is the same as the conditional distribution 

of Xn+i given {Xi}'^^_^. We allow for finite or countably infinite 
alphabet size. 

Les processus Markoviens finitaires sont des processus {Xn\^=^oc, 
pour lesquels il existe un entier K fini [K = K{{Xn}^=-^) tel que 
la distribution conditionnelle de Xi etant donne tout le passe soit 
egale a la distribution conditionnelle de Xi etant donne seulement 
{Xn\^^i_x- La plus petite valeur d'un tel K est appelee la longueur 
de la memoir e. Nous donnons une analyse complete du probleme de 
I'estimation de la plus petite de ces valeurs de K, aussi bien en re- 
montant dans le passe qu'en allant vers le futur, c'est a dire quand on 
observe les valeurs successives de {-^n} pour n > et qu'on recherche 
la plus petite valeur de K telle que la distribution conditionnelle de 
Xn+i etant donne {Xi}^^^_^_^^ soit la meme que la distribution con- 
ditionnelle de Xn+i etant donne {Xi}'^^_^. La taille des alphabets 
pent etre choisie finie ou infinie. 



1 Introduction 



An important class of stationary ergodic processes that greatly extends 
the finite order Markov chains is the finitarily Markovian class. Informally, 
these are those processes {Xn}'^=_oo for which there is a finite K (that de- 
pends on the past {X„}, n <0) such that the conditional distribution of Xi 
given the entire past is equal to the conditional distribution of Xi given only 
{X„}, 7^ < n < 0. When the process is a Markov chain of order L then one 
can simply take K = L independent of the values that the process takes. 
However, even for such Markov chains, quite often a smaller value may ex- 
ist for certain realizations of the process. Our main goal here is to give a 
rather complete analysis of the problems of universally estimating the least 
such value of K, both in the backward sense that we have just described and 
in the forward sense, where one observes successive values of {X„},n > 0. 
For the case of finite alphabet finite order Markov chains similar questions 
have been studied by Biihlman and Wyner in [2]. However, the fact that 
we want to treat countable alphabets complicates matters significantly. The 
point is that while finite alphabet Markov chains have exponential rates of 
convergence of empirical distributions, for countable alphabet Markov chains 
no universal rates are available at all. 

We encountered this problem in [16] where we gave a universal estimator 
for the order of a Markov chain on a countable state space, and some of the 
techniques that we use here have their origin in that paper. Before describing 
our results in more detail let us define more precisely the class of processes 
that we are considering. First let us fix the notation. Let be 

a stationary and ergodic time series taking values from a discrete (finite or 
countably infinite) alphabet X. (Note that all stationary time series {X„}J^q 
can be thought to be a two sided time series, that is, ) For 

notational convenience, let = {Xm, ■ ■ ■ ,Xn), where m < n. Note that if 
m > n then is the empty string. 

For convenience let p(x°^) and p{y\x^i^) denote the distribution PlX'^f^ = 
x^i.) and the conditional distribution P(Xi = y|X°^ = x^j^), respectively. 



Definition 1 For a stationary time series {X„} the (random) length K{X^^) 
of the memory of the sample path X^^ is the smallest possible < K < oo 
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such that for all i > 1, all y E X , all G 

provided p{zZK-i+ii ^-K+vV) > 0' o,nd K{X'^^) = oo if there is no such K . 



Definition 2 The stationary time series {X^} is said to be finitarily Marko- 
vian if K{X^^) is finite (though not necessarily bounded) almost surely. 

This class includes of course all finite order Markov chains but also 
many other processes such as the finitarily determined processes of Kalikow, 
Katznelson and Weiss [10] , which serve to represent all isomorphism classes of 
zero entropy processes. For some concrete examples that are not Markovian 
consider the following example: 

Example 1 Let {M„} be any stationary and ergodic first order Markov 
chain with finite or countably infinite state space S. Let s G S" be an arbitrary 
state with P(Mi = s) > 0. Now let X„ = I{m„=s}- By Shields [2l] Chapter 
I.2.C.1, the binary time series {Xn} is stationary and ergodic. It is also 
finitarily Markovian. Indeed, the conditional probability P{Xi = 1\X^^) 
does not depend on values beyond the first (going backwards) occurrence of 
one in X^^ which identifies the first (going backwards) occurrence of state s 
in the Markov chain {Mn}. The resulting time series {X„} is not a Markov 
chain of any order in general. Indeed, consider the Markov chain {M„} with 
state space S = {0, 1,2} and transition probabilities P(M2 = l|Mi = 0) = 
P(M2 = 2|Mi = 1) = 1, P(M2 = 0|Mi = 2) = P(M2 = l|Mi = 2) = 0.5. 
This yields a stationary and ergodic Markov chain {M„}, cf. Example 1.2.8 
in Shields [2lj. Clearly, the resulting time series X„ = I{Mn=o} will not be 
Markov of any order. The conditional probability P{Xi = 0\X^^) depends 
on whether until the first (going backwards) occurrence of one you see even 
or odd number of zeros. These examples include all stationary and ergodic 
binary renewal processes with finite expected inter-arrival times, a basic class 
for many applications. (A stationary and ergodic binary renewal process 
is defined as a stationary and ergodic binary process such that the times 
between occurrences of ones are independent and identically distributed with 
finite expectation, cf. Chapter I.2.C.1 in Shields [21]). 
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We note that Morvai and Weiss [TB] proved that there is no classification 
rule for discriminating the class of finitarily Markovian processes from other 
ergodic processes. 

For the finitarily Markovian processes an important notion is that of a 
memory word which is defined as follows. 

Definition 3 We say that w^^+i is a memory word if p{w^j^^i) > and for 
all i>l, ally e X, all zZ^-i+i e X' 

piy\w-k+i) =p{y\zZk-i+i,'W-k+i) 

provided p{zZk-i+i,w\^^,y) > 0. 

Define the set Wfc of those memory words fc+i with length k, that is, 
Wfe = {w'^k+i ^ • ""^-fc+i is a memory word}. 

Our first result is a solution of the backward estimation problem, namely de- 
termining the value of K{X^^) from observations of increasing length of the 
data segments We will give in the next section a universal consistent 

estimator which will converge almost surely to the memory length K{X^^) 
for any ergodic finitarily Markovian process on a countable state space. The 
proofs that we give are pretty explicit and given some information on the 
average length of a memory word and the extent to which the stationary 
distribution diffuses over the state space one could extract rates for the con- 
vergence of the estimators from our estimates. We concentrate however, on 
the more universal aspects of the problem. 

As is usual in these kinds of questions , the problem of forward estimation, 
namely trying to determine K{X'^^) from successive observations of Xq 
is more difficult. The stationarity means that results in probability can 
be carried over automatically. However, almost sure results present serious 
problems. For example, while Ornstein in [21j (cf. Morvai et. al. [T2] 
also) showed that there is a universal consistent estimator for the conditional 
probability of Xi given based on successive observations of the past. 

Bailey fl] showed that one simply cannot estimate the forward conditional 
probabilities in a similar universal way. One can obtain results modulo a zero 
density set of moments, but if one wants to be sure that when one is giving 
an estimate that eventually the estimate converges one is forced to resort to 
estimating along a sequence of stopping times (cf Morvai [TT], Morvai and 
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Weiss [13], [13], [IS])- For some more results in this circle of ideas of what can 
be learned about processes by forward observations see Ornstein and Weiss 
[22] . Dembo and Peres [6], Nobel |i20j, and Csiszar ^. 

Recently in Csiszar and Talata [5] the authors define a finite context to 
be a memory word w of minimal length, that is, no proper suffix of w is a 
memory word. An infinite context for a process is an infinite string with all 
finite suffix having positive probability but none of them being a memory 
word. They treat there the problem of estimating the entire context tree in 
case the size of the alphabet is finite. For a bounded depth context tree, 
the process is Markovian, while for an unbounded depth context tree the 
universal pointwise consistency result there is obtained only for the truncated 
trees which are again finite in size. This is in contrast to our results which 
deal with infinite alphabet size and consistency in estimating memory words 
of arbitrary length. This is what forces us to consider estimating at specially 
chosen times. 

In the succeeding two sections §3, 4 we will present two such schemes 
which depend upon a positive parameter e, and we guarantee that sequence 
of times along which the estimates are being given have density at least 1 — e. 
The purpose of the next two sections is to show that this result is sharp in 
that the e cannot be removed even in more restricted classes of processes. 
In §5 we show that you cannot achieve density one in forward estimation of 
the memory in the class of Markov chains on countable alphabets, while in 
§6 we prove a similar negative result for binary valued finitarily Markovian 
processes. 

The last part of the paper is devoted to seeing how this memory length 
estimation can be applied to estimating conditional probabilities. In §7 we 
do this for finitarily Markovian processes along a sequence of stopping times 
which achieve density 1 — e. We do not know if the e can be dropped in this 
case for the estimation of conditional probabilities. 

We can dispense with e in the Markovian case. In §8 we use an earlier 
result of ours on a universal estimator for the order of a finite order Markov 
chain on a countable alphabet in order to estimate the conditional probabil- 
ities along a sequence of stopping times of density one. 
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2 Backward Estimation of the Memory Length 
for Finitarily Markovian Processes 



In order to estimate K{X^^) we need to define some explicit statistics. The 
first is a measurement of the failure of w^_j^j^]^ to be a memory word. 

For of positive probability define 

sup sup 



Clearly this will vanish precisely when fc+i is a memory word. We need to 
define an empirical version of this based on the observation of a finite data 
segment X^^. To this end first define the empirical version of the conditional 
probability as 

^ . I ^ _ #{-n + - 1 < t < -1 : XltU, = {w\^,, x)] 

These empirical distributions, as well as the sets we are about to introduce 
are functions of but we suppress the dependence to keep the notation 
manageable. 

For a fixed < 7 < 1 let denote the set of strings with length k+1 which 
appear more than -n}"'^ times in X^^. That is, 

CI = {x\ e : + k<t<Q:Xl,^ x\} > n^-^- 

Finally, define the empirical version of as follows: 



AI'(w°r., 1) = max max 



Let us agree by convention that if the smallest of the sets over which 
we are maximizing is empty then = 0. Observe, that by ergodicity, 
the ergodic theorem implies that almost surely the empirical distributions p 
converge to the true distributions p and so for any w^^^ e X^, 

liminf A^(w° j;..,.^) > Ak{w\_^i) almost surely. 
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With this in hand we can give a test for w° fc+i to be a memory word. Let 

< /5 < be arbitrary. Let NTEST^{w%^^) = YES if A^(w%+J < n"^ 
and NO otherwise. Note that NTEST^ depends on X\. 

Theorem 1 Eventually almost surely, NTESTn{w'^f._^i) = YES if and only 
if w^-k+i 'is 0, memory word. 

We define an estimate Xn for K(X^^) from samples X\ as follows. Set 

Xo = 0, and for n > 1 let x„ be the smallest < k < n such that 
NTESTn{X^i,_^_^) = YES if there is such and n otherwise. 

Theorem 2 Xn — KlX'^^) eventually almost surely. 

In order to prove these theorems we need some lemmas. The first is a variant 
of the simple fact that the states Ui that follow the successive occurrences of 
a fixed memory word w are independent and identically distributed random 
variables. We cannot use such a naive version because we are dealing with 
a countable alphabet, and thus even the collection of memory words of a 
fixed length is infinite. In order to cut down to a manageable set we would 
like to consider only those words that appear in the sample X\, but now 
the independence becomes a little subtler. This is the reason for the rather 
forbidding looking formulas in the proof of the next lemma. What we do is 
fix a location (/ — k, I] in the index set and then fix a memory word w'^i^_^_i that 
occurs there together with a particular state x that follows it. The random 
times / + A;^ and / — X~ are the other occurrences of this memory word in 
the process. Here is the formal definition. Set A^^ q = 0, A;"^ q = and define 

Ai,< = Ai,_. + min{t > : CHr^.^. = C^r.-.+J W 

and 

A^, = A^,_, + min{t > : = ^t^;:-,^^ (2) 

Lemma 1 Assume ty° fe+i is a memory word and x is a letter. Then for any 

are conditionally independent and identically distributed random variables 
given X\_f,j^^ — w^_y,_^-^, -'^i+i = x, where the identical distribution j^.^^). 
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Proof: Fix the values z^i, . . . , z^i, Ui, . . . , Uj and x in the alphabet and 
calculate 



^«+A+^^+l =Ui,..., Xi^^+^^^^^ = Uj\Xl_^^^ = W° = x) 

In order to be able to use the fact that w'^j^^i is a memory word we will shift 
back to the first occurrence at I — \Jf, .i and use the stationarity. 

, -'^/-fc+i = Xl_^_l = X, A;";, j = t) 

^i-i+A+_j .+1 = ^5 \-t,k,i = 0) 
= ^(^Z+A+, = • • • > ^i+A+,_._,+l = ^i+A+,_.^,+l = «1, • • ■ , ^z+A+^ .^.+1 

Summing over t, 

-f'(^z-A(-;^ .+1 ^ ■ ■ ■ ' ^i-A;-;j,_i+i ^ ^-1' ^'+A^fc,i+i = ■"'1' ■ ■ ■ ' -'^i+A+j^ ,,+1 ^ ■"'J' 
^l+X+^'^.-k+l = ^-k+i:^i+x+^.+i = X)- 



Now telescoping the right hand side we get 



k+l — "^-fc+lJ 



h=l 



nLiP{x^ = u,\x^_,^, 



w 



-k+l) 



h=l 



3 

w 

h=l 



W 



^ 



We have to prove that 

P(Xi = z_,\X^_,_,, = w^,^,) = P{X,_,-^^^^, = z_,\Xi,^, = w\^„X,^, = X). 
Indeed, by ([3]) and stationarity. 



PiX, 



k+l 



PiXU^,=w^_,^„Xi^, = x) 
P{Xlk+i = w\^i)P{Xr = z_u\X\^, = w\^,] 



P(X,+i = x\Xl 



k+l — ""^-fc+l' -^l-Xi,^ f^+l ~ ^-h) 



W 



> 

-k+l/ 



P{X%^, = w\^,)P{Xi^, = x\Xi,^, 

= p(Xi = ^_,|xVi = ^Vi)- 

The proof of Lemma [1] is complete. 
Lemma 2 

P ( For some < k < n, -n + k - 1 < I < -1 : Xj^l^^ E Q+i, K{Xl^) < k, 
p„(Xj+i|X/_fc+i) -p(Xi+i|X/_fc+i) > n 



h= [ni-^J 



Proof: For a given < k < n, —n + k — 1 <l < —1 assume that XjZ^^-^ = 
w'^k+i^ '^-k+i is ^ memory word. Since w^^+i ^ memory word, by 
Lemma [1] and by Hoeffding's inequality (cf. Hoeffding [9J or Theorem 8.1 
of Devroye et. al. [7J) for sums of bounded independent random variables 
implies 



--X} 



i+j 



-2n-2/5(i+j) 



Multiplying both sides by P{Xl^l^^ = w^,^^ix) and summing over all possi- 
ble memory words fc+i and x we get that 



i+j 



- p{Xi^,\xi,^,: 



> n 



Summing over all pairs (fc, I) such that {] <k < n and all ~n + k — l < / < — 1 
and over all pairs such that « > 0, j > 0, i + j > [n^^'''J we complete 
the proof of Lemma [2l 

Lemma 3 

P( ^max K{w\^,) > n-^) <n' h4e^^ . 



h=[ni-Tj 



Proof: 



P( ^max AUw'_,^,) > n'^) 

n 

< y^-P( max max 

n 

< -P( max max 

1=1 -fc+l^"^*: \^-k-i+l^"'-k+l'-^l^'-k+i 

n 

+ -P( max max 



Pn(a;kVi) -Pn(a;|2;_t 



i=l 



p{x\z'_ 



k-i+ly'^-k+l 
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By Lemma [2], both terms inside the sum can be upperbounded by an ex- 
ponential, and summing over i we get the statement and so the proof of 
Lemma [3] is complete. 

Proof of Theorem [Tt 

If w^^fe+i is not a memory word, then there are zZ^-i+i and x such that 
p(a;|w°fc+i) ^ p{x\zZk-i+iwlk+i) and p{zZk-i+iW^-k+ix) > 0. By ergodicity, 
NT EST J w_j^_^_l) — NO eventually almost surely. 

Assume w° ^,+1 is a memory word. We will estimate the probability of the 
undesirable event as follows: By Lemma [31 

P(A^(«;Vi) > ^-") < E Me^. 

h=[nl-Tj 

The right hand side is summable provided 2/5 + 7 < 1 and the Borel-Cantelli 
Lemma yields that 

P(A^(ti;°;,,+i) < n-f^ eventually) = 1 

and so NTESTn{w^i^_^_i) = YES eventually almost surely. The proof of 
Theorem [1] is complete. 

Proof of Theorem [2t Since X^j^^-^o is a memory word and none 

of its suffixes has this property, Xn = -^(-^-00) eventually almost surely, by 
Theorem [1] . The proof of Theorem [2] is complete. 

3 Forward Estimation of the Memory Length 
for Finitarily Markovian Processes 

Define PTESTn{w\_^^){XJ^) = NTESTn{w\_^^){T''X'S) where T is the left 
shift operator. 

Theorem 3 Eventually almost surely, PT E STn{w^_f^j^i) = YES if and only 
if w^-k+i ^ memory word. 

Define a hst of words {w(0), w{l),w{2), . . . , w{n), . . .} such that all words of 
all lengths are listed and a word can not precede its suffix. Note that w(0) 
is the empty word. 



10 



Now define sets of indices as follows. Let A° = {0, 1, . . . , n} and for i > 
define 

= {\wit)\ -l<j<n: = wit)}. (4) 

Let e > be fixed. Define ^n(e) < n to be the minimal j such that 



Ui<j:PTEST„{w{i))=YES 



n + 1 



> 1 - e/2 (5) 



and n if no such j exists. We estimate for the length of the memory of 
X!!^ looking backwards if n G [Ji<9Ue),PTESTUwii))=YES The set of n's 
for which this holds will be the set for which we estimate the memory and 
we denote this set by Af. Note that the event n E Af depends only on Xq, 
and thus Af can be thought of as a sequence of stopping times. 
We define for n G Af, 

Kn = mm{t > : = w{t), PTEST^wii)) = YES}. 

For n E A/ define 

p„(Xo") = \w{Kn)\. 

Note that p„, 6n, Kn and Af depend on e, however, we will not denote this 
dependence on e explicitly. 

Theorem 4 Let e > be fixed. Then for n E Af, 

pn = K{X'^^) eventually almost surely, (6) 

and 

li.ni,jl^^'«-^--"-"l>l-.. (7) 

For n E Af, appears at least n^~'^ times eventually almost surely. 



Proof of Theorem [31 Since the proof of Theorem [T] was based on a Borel- 

Cantelli lemma, the time shift in defining PTEST^ makes no difference and 
we literally copy the proof of Theorem [H The proof of Theorem |3] is complete. 

Proof of Theorem [4t There is a large enough such that 
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P{KiX'^J < AT) > 1 - e/4. 

The sequence 9n is bounded along individual sequences of the process with 
probability one. (This may be seen by first choosing a sufficiently large finite 
set {w{0), . . . ,w{M)} of memory words so that the probability of seeing at 
least one of them in position zero is greater then 1 — e/4 and then applying 
Theorem [3] and the ergodic theorem we see that almost surely for all suffi- 
ciently large n, On < M. This implies of course that On is bounded pointwise 
as claimed.) Thus by Theorem [3l 

Pn = K{XU provided G WKix^^^) n{w{0), . . . , w{On)} 

eventually almost surely. We have proved the consistency. Let J denote the 
smallest j such that 

t P(XVi G m n{^(0), ■ ■ ■ , >!-'-. 



It is obvious from the definition above that 

J2 P(xVi e m f]M0), ...,w{J-i)})<i 

k=0 

Thus On > J eventually almost surely. Thus 

|Arn{0,l,...,n}| 



lim inf 

n^oo n+1 



> lim inf 



> 1 almost surely. 



n + 1 

We have proved that Af has density at least 1 — e/2. Since On is bounded, for 
n & M eventually, w{K,n) appears at least n^~'^ times. The proof of TheoremH] 
is complete 



4 Another Approach to Estimating the Mem- 
ory Length for Finitarily Markovian Pro- 
cesses 

In the preceding section we made use of the fact that the proof that we gave 
for the backward memory estimator was via a rough probability estimate and 
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the Borel-Cantelli lemma. This enabled us to copy it directly for the forward 
estimation. In this section we shall show that any successful backward mem- 
ory estimator can be used to get the same kind of result. We will denote 
by Xn some fixed consistent backward estimator for the memory length such 
as the Xn of § |2j To this end, based on the successive forward samples we 
construct many infinite sample points of the process. 
To construct a sample of the X^^ process from the forward data segment 

Xq , we use the procedure that we used in Morvai and Weiss [U]. Begin with 
Xo, then look for its first recurrence, i.e. the minimum to > such that 
XtQ = Xo and then extend Xq to the left by adding Xt^-i. Next look for the 
first recurrence of Xtg-iXt^, in a position ti > to, i.e. Xt-^^iXt-^ = Xt^^^iXtQ 
and then again extend to the left by adding Xj^_2 obtaining Xtj_2Xj^_iXt^ 
as the first three symbols of our sample for the backward process. We will 
denote this by X°2 = Xll_2. Continuing in this way, we can develop from 
X^ a point X^^^ which we shall show has the same distribution as X^^^. We 
need to do this starting at each i > 0. Here are the formulas that accomplish 
this end. 

For i = 0,1, . . . define auxiliary stopping times. Set C-i(^) = ""^ and (o{i) = 
0. For n = l,2,..., let 

= CnM^) + min{t > : X::^^-^:^;^^,)^, = X:^:l^^^^_,^}. (8) 

Among other things, using Cn(0 "^^ can define very useful processes {X„(i)}°^. 

function of X^ as follows. 
Define 

X_„(i) = Xi+(;„(j)_„. (9) 

It is clear that in this way we defined processes {Xn{i)}n=-oo- We will see 
that the {X„(i)}°^_^ has the same distribution as the original process, and 
for now assume that this is so. 
Let 

r]nii) = max{j > -1 : i + < n} (10) 

Note that (X_^^(j)(i), . . . ,Xo(i)) is measurable with respect to Xq . 

Define = X»7„(i)("^-oo(*)) if Vn{i) > and = otherwise. Note that 
is also measurable with respect to Xg . 
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Define sets of indices as follows. 

K = {Pn<J<n: Xl_^^, = (11) 

For any fixed i, eventually, ^-^(i) is a memory word, so the sets Al^ are 
simply the places where this fixed word occures. Let e > be fixed. Define 
On{e) to be the minimal j such that 



n 



>l-e/2. (12) 



We estimate for the order of looking backwards if n e Ui<6i„(e) ^n- The 
set of n's for which this holds will be the set for which we estimate the 
memory and we denote this set by A/". Note that the event n E M depends 
only on Xq , and thus M can be thought of as a sequence of stopping times. 
In case n G Ui<0„(e) define 



mm-^ 



Note that and H depend on e, however, we will not denote this 

dependence on e explicitly. 

Theorem 5 Let e > 6e fixed. Then for n G M , 

p'^ — K{X'^^ eventually almost surely, (13) 

and 

liminfl^^<°'^— "-^>l>l-e. (14) 

For n G Af, X^_^«n appears at least n^'"' times eventually almost surely. 

Lemma 4 For all i the time series {Xn{i)}n=~oo ^'^^ {^n}n=-oo have iden- 
tical distribution. 

Proof: For all A; > 1 and 1 <i < k define Co = 

cf = - min{t > : ^|;:;_,_, = ^|;_(._,}- 

Let T denote the left shift operator, that is, {Tx'^^)i = Xj+i. It is easy to 
see that if and only if Cit(0(^-oo) = ^ then Cfe (2"^'+'^2;~^) = Now the 
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statement follows from stationarity and the fact that for k > 0, x^_f, G X^^^, 
I > 0, 

T'^'{x^(^i-k = a(^) = i} = {X', = x\ cj:{x'_j = -I}. (15) 

The proof of Lemma H] is complete. 

Lemma 5 // P{Xq = w^) > for the string Wq then almost surely, 

X^j-Xi) = Wq for some i. (16) 

Proof: Let t denote the n + 1-th occurrence of the string w'q in X^. It is 
easy to see that there must be a < z < t such that 

and so 

= <■ 

The proof of Lemma [5] is complete. 

Proof of Theorem [5] There is a large enough such that 

P(K(X°^) < iV) > 1 - e/4. 

Then by Lemma O and ergodicity, 9n is a bounded sequence (cf. the proof of 
Theorem Hj) . By Lemma H] and Theorem [2] 

pt. = i^(XO^(^))for allz = l,...,^„ 

eventually almost surely. We have proved (fT3l) . We have to prove (fT4l) . Let 
J denote the smallest j such that 

It is obvious from the definition above that 

i=0 

Thus 9n> J eventually almost surely. Thus 

|Arn{0,l,...,n}| 



e 

2' 



lim inf 



n + 1 



> lim inf _ _ 

n^oo n+1 2 



> 1 almost surely. 



The proof of Theorem [S] is complete. 
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5 Memory Estimation for Markov Processes 



In this section we shall examine how well can one estimate the local memory 
length for finite order Markov chains. In the case of finite alphabets this can 
be done with stopping times that eventually cover all time epochs. (Indeed, 
assume {Xn} is a Markov chain taking values from a finite set. Assume 
ORDESTn estimates the order in a pointwise sense from data Xq e.g. as in 
Csiszar and Shields [3J or in Morvai and Weiss [L6\. Then let 

Pn = min{0 < t < ORDESTn : PT E STn{Xl_t^-^) = YES} 

if there is such t and otherwise. Since ORDESTn eventually gives the right 
order and there are finitelly many possible strings with length not greater 
than the order thus p„ = K{X!!:^) eventually almost surely by Theorem O) 
However, as soon as one goes to a countable alphabet, even if the order 
is known to be two and we are just trying to decide whether the X„ alone 
is a memory word or not, there is no sequence of stopping times which is 
guaranteed to succeed eventually and whose density is one. This shows that 
the e in the preceding sections cannot be eliminated. 

Theorem 6 There are no strictly increasing sequence of stopping times {A„} 
and estimators {/i„(Xo, . . . ,Xx^)} taking the values one and two, such that 
for all countable alphabet Markov chains of order two: 

lim — = 1 

n — *oo fi 

and 

lim |/i„(Xo, . . . , X\^) — K(Xq")\ = with probability one. 

To prove the theorem we will assume that such a pair of stopping times 
and estimators exist and construct a Markov chain {Xn} of order two for 
which they fail. The Markov chain of order two that we construct will have 
for its state space the nonnegative integers N , and it will be a perturbation 
of the 1-step Markov chain Zn defined by the following formulae: 

Ps,s+r = 'i'"'^ for all r > 1, 
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and 

Psj = 2-^-2 for all 0<j<s. 

Notice that from any state s > there is a fixed probability of | of going 
to 0. Also there is a strictly positive probability of going from any state 
to any other. These properties ensure that there is a finite stationary mea- 
sure. The ultimate chain will preserve most of these conditional probabilities, 
with the difference depending on a sequence of integers tk » k which will 
be defined later. The perturbed chain {X„} will have the same transition 
probabilities as the original chain {Z„} for X„ given (X„_2, Ar„_i) when the 
latter, (X„_2j^n-i), equals any pair (t, s) with the exception of (tk-k) for 
k > 0. In that case we will modify the probability of the transitions to k 
and A; + 1 by interchanging the values of P^^k ^-nd Pk,k+i- As soon as the first 
change is made ceases to be a memory word and therefore the order of the 
new chain is two. Eventually all singletons cease to be memory words. 

The tfc's will be chosen inductively in a fashion depending on the pur- 
ported sequence of stopping times and estimators that we are trying to show 
cannot exist. At the k-th stage we will have only made these changes up to 
k. Let us denote the Markov process of order 2 that this defines by {Y^[''^. 
More explicitly the process {yn^^-*} is defined as follows. It has transition 
probabilities given by: for all j < k 

p{y!i% = J + ^y^'^ = tj, Y^i\ = j) = p{z^^, = j\z^^, = j), 

and for all other values of {u, t, s) we have 

p{Yi% = = t, yiS = s) = p(z„+2 = ^l^n+i = s). 

Thus for this process, all singletons j for j > k are still memory words of 
length one, but none of the j with < j < k are. The main technical 
lemma that we will need is that the distribution of finite blocks up to some 
preassigned length N of the {y^J'^^} and l^^*^"''^^} processes are arbitrarily 
close if the tk+i is chosen sufficiently large. This is independent of the other 
properties of tk that are needed, and so we begin by establishing this fact. 

Our proof will be via a coupling argument. We will calculate the finite 
distributions of these two processes by calculating time averages of a pair of 
typical sequences generated by transition matrices starting from the pair 00. 
The coupling is especially easy since for both processes , from any state there 
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is a fixed probability of moving to 00 of at least | and therefore no matter 
how the two sequences diverge if we continue their evolution independently 
there is at every moment a fixed probability, namely ^, of the processes 
returning simultaneously to 00. 

Lemma 6 With the definitions above for {Y^^'^^} and {YJ^'''^^^} if N and 5 > 
are arbitrary, for any choice of t^^i that is sufficiently large we will have 

(k) (k)\ 

that the variational distance between the distributions of [Yq , . . . , ) and 
(Fo^'+'\...,F#+'^) ts at most 6. 

Proof: Since the {^^n^'^^} is fixed at the start, given N and 5 we can choose 
T sufficiently large so that the stationary probability n^^\t) < 7 for any 
t > T, where 7 = 5/(64 + 2A^). 

Suppose that we choose tk+i > T. We begin the coupling by start- 
ing each of the processes at the pair 00. Denote by uj and vj the random 
sequences constructed by applying the transition functions for the two pro- 
cesses {YJ;;^^}, {Y^i'^^^^ respectively. Until Uj = tk+i for the first time the 
sequences can be taken to be identical, since they have the same transition 
probabilities for pairs that do not include this state. We denote by ai this 
moment, and continue the coupling now independently waiting for the first 
moment j > ai that the equality {uj,Uj+i) — {vj,Vj+i) — (0,0) holds. Call 
this moment Ti. Notice that ai is a function of the u'^s while Ti is a func- 
tion is a function of both processes. Beginning with Ti we can once again 
continue the evolution in an identical fashion until the first moment j > T\ 
that Uj = tk+i- Call that stopping time (T2. Note that this stopping time 
also depends on both processes. As before, as soon as this happens continue 
the processes independently until the first moment j > 0-2, that the equal- 
ity ('Uj,'Uj+i) = {vj^Vjj^\) = (0,0) holds. It should now be clear how this 
is continued to build (with probability one) typical sequences for the two 
processes. In order to compare the stationary distributions of words up to 
length N in the two processes we need to know what is the relative frequency 
of the periods when we are coupling independently compared to the periods 
when we are producing the same symbols. 

The asymptotic frequency of the occurrence of tk+i in the Uj sequence 
is known to be at most 7 and at the stopping times cXi u^^ = tk+i- The 
gaps Ti — ai are independent for different i's and have a length which has 
a geometric distribution with fixed parameter ^ as we remarked earlier. 
Thus the average fraction of the time that the A^-strings in the u and v 
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sequences do not match exactly is at most (64 + 2N)^. It follows that the 
variational distance between iYQ'\ . . . , Y^^) and {Yq^^^\ . . . , Y^'^^'') is at 
most (64 + 2N)'j = 6 and thus the the lemma has been established. 
We can now give the 

Proof of Theorem [6t 

Suppose that there does exist a sequence of stopping times and estimators 
as in the statement of the theorem. We begin with the one step Markov chain 
Zn described above and observe that the state has a positive stationary 
probability. Since the A^'s have density one we can find an A^^o so that with 
probability at least 1 — ^ in the string there will be some A„ < A^o with 
Zx„ = and 

hniZf), . . . , Zx„) = 1 

We can apply the lemma with N = Nq and S = j^to find a suitable to with 
which we can define a {yj")} process in which now is not a memory word 
, so that for those strings where Zx^ = and hn{ZQ, . . . , Zx„) = 1 a definite 
mistake is being made. Such strings with length Nq still have probability 
at least 1 — ^. Having defined {K^^"-*} we notice now that the state 1 is 
still a memory word of length one with positive stationary probability, and 
therefore we can find an A^^i sufficiently large so that with probability at least 
1 — in the string {Yq^\ . . . , Yj^^) there will be some A„ < Ni with Y^^^-* = 1 
and 

As before we apply the lemma with N = Ni and S = 10~^ to find a 
suitable ti with which we can define the next process For this process 

although 1 fails to be a memory word we still can estimate the probability 
that in a {Yq^\ . . . , Y^^"*) string there will be some A„ < Ni with y!^^ — 1 
and 

K{Yo^'\...,Yll^) = l 

as being at least 1—2 x 10^^. In addition, the previous estimate on strings 
of length A^o is degraded only by 10~^, since the estimate on the variational 
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distance descends to strings of shorter length. By now it should be clear how 
to continue the inductive construction of the t^s. The ultimate process that 
we obtain , which we may denote simply by {Xn} is of course a Markov chain 
of order two, and it has no memory words of length one at all. However, for 
every k, the probability that there will be some A„ < Nk with = k and 

hn{Xo, . . . , Xx„) = 1 

will be at least 1 — g^Qk- The Borel-Cantelli lemma implies that with 
probability one there will be infinitely many mistakes being made by our 
estimator contrary to the assumption. This concludes the proof of Theorem[6l 

6 Limitations for Binary Finitarily Marko- 
vian Processes 

In the preceding section we showed that we cannot achieve density one in 
the forward memory length estimation problem even in the class of Markov 
chains on a countable alphabet. In this section we shall show something 
similar in the class of binary (i.e. 0, 1) valued finitarily Markov processes. 
To prove this we will assume that there is given a sequence of estimators and 
stopping times, A„) that do succeed to estimate successfully the memory 
length for binary Markov chains of finite order and construct a finitarily 
Markovian binary process on which the scheme fails infinitely often. This 
differs from the proof outline of the previous section. There a contradiction 
was reached showing that the purported estimators do not exist. In the 
present case, as we remarked in the opening paragraph of §5 there does exist 
a sequence of estimators hn which eventually succeed in giving the memory 
length almost surely for all binary Markov chains of finite order. Here is a 
precise statement: 

Theorem 7 For any strictly increasing sequence of stopping times {A„} and 
sequence of estimators {/z„(Xo, . . . ,Xx^)}, such that for all stationary and 
ergodic binary Markov chains with arbitrary finite order, lim„^oo ^ = 1? ^'^^ 

lim \hn{Xo, . . . , Xa„) — K{Xq")\ = almost surely 
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there is a stationary, ergodic finitarily Markovian binary time series such 
that on a set of positive measure of process realizations 

h^iXo,...,X,J^K{X'_-J 

infinitely often. 
Proof: 

First we define the same Markov-chain as in Ryabko [23] (cf. also Gyorfi, 
Morvai, Yakowitz [S], Morvai and Weiss [17j) which serves as the technical 
tool for construction of our counterexample. Let the state space S be the 
non-negative integers and define the transition probabilities follows: 

Po,i = Pi,2 = 1, and for all s > 1: Psfl = Ps,s+i = \- 

This construction yields a stationary and ergodic Markov chain {Mj} with 
stationary distribution 

P(M = 0) = P(M = 1) = ^ 

and 

P(M = i) = - for z > 2. 

We shall construct a finitarily Markovian process X„ by defining a certain 
function / from the state space S to {0, 1} and setting X„ = f{Mn). We 
will ensure that it is finitarily Markovian by taking care that /(O) = /(I) = 
0,/(2) = 1 and for all s > 2 if f{s) = then f{s + 1) = 1. Thus, in 
the Xn process whenever one observes two successive zeroes and a one, it is 
known that the underlying states in the Markov chain were 012. Note that 
if there is an integer K such that f{i) = 1 for all i > K — 1 then the process 
{Xn} is a binary Markov-chain with order not greater than K. (Indeed, 
the probabilities P(X„ = l|Xo, . . . , Xn-i) are determined by the last K bits 

We will define / in stages using the stopping times and estimators that the 
hypotheses of the theorem give us. At stage j there will be an f^^^ and we will 
define a binary-valued process, {Xf^'^l by the formula: X^'^ = f^^\M,) where 
/(•?) will be a {0, 1} valued function of the state space S which is eventually 
one. As remarked, this ensures that all these processes are actually finite 
order Markov chains. The desired / will be the limit of these /'•-'^'s and it 
will take the value infinitely often. Now for the definition. 
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For all < j < oo, set f^^\0) = 0, f^\l) = 0, and f^\2) = 1. 

Define f^^^k) = 1 for all A; > 3, hence since f^^\i) is eventually 1, the 
process {X-*^^ = /(°^(Mj)} is a stationary ergodic binary Markov chain with 
order /cq < 3. 

Recalling the stopping times and estimators define the event 

A,{h,s,) = {FoTSomen: K{f'\Mo),...,f'\M^J)<ko, 

/(°)(Mi) = 1 for A„ - A;o + 1 < i < A„,ti < A„ < si.} 

Notice that this is a well defined event in the sample space of the Markov 
chain M„. All of the events that we are about to define are in that one fixed 
sample space, only the function /*^*^ will be changing. By the hypotheses of 
the theorem there are sufficiently large si > ti > 3 such that the probability 

P(A(ii,si)|Mo = 0,Mi = 1,M2 = 2) >l-2-\ 

Let = for i = 0, 1 . . . , si and let f^^\si + 1) = 0, = 1 

for i > Si + 2. It is clear that the memory of a sequence with prefix 1° ^q-i 
in the process X^p = /*^^^(M„) is greater then ko but if the event Ai occurs 
then the estimator will commit an error at least once in the interval [ti, Si]. 
The new process has an order ki < Si + 3. We will continue in this manner 
inductively. Assuming that we have already defined kj, tj, Sj, Aj we will now 
go to stage j + I and show how to update these parameters. 
Let Sj+i > tj+i > Sj + 3 be chosen such that for the event 

= {For some n : /i„(/(^)(Mo), . . . , f^\M^J) < kj, 
f^\Mi) = 1 for A„ - A;,- + 1 < i < A„,i,+i < A„ < 

we have: 

P(^,+i(i,-+i, s,+i)|Mo = Ml = 0, M2 = 1) > 1 - 2-(^+^). 

Set now f^^+^\i) = for i = 0, 1, . . . , s^+i and let f^+^\sj+i + 1) = 

and f^^^^\i) = 1 for i > Sj^i + 2. It is clear that the memory of a sequence 
with suffix a string of I's of length kj in the process A"^*-/^^-* = f^^'^^^Mn) is 
greater then kj and if the event Aj+i happens then the estimator will commit 
error at least once in the interval [tj+i, Sj+i]. The new process has an order 
kj+i < Sj+i + 3. By induction, we have defined all the functions /^^^ for 
< J < oo. To complete the definition of / simply put / = lim/^''\ By the 
construction this is certainly well defined. 
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By the Borel-Cantelli Lemma, conditioned on the positive probabihty 
event M0M1M2 = 012, the events Aj occur infinitely often almost surely and 
this completes the proof of Theorem [3 

Remark 1 In the final process X„ that we constructed P{K{X^^) = k) 
decays to zero exponentially fast and in particular is summable. It follows 
that with probability one eventually K{Xq) < n so that the reason for our 
failure to estimate the memory length correctly is not coming about because 
we don't even see the memory word. 

It is also worth pointing out the sequence of moments on which the es- 
timator is failing is of density zero. It follows fairly easily from the ergodic 
theorem that if one is willing to tolerate such failures then a straightforward 
application of any backward estimation scheme will converge outside a set of 
density zero. The effort that we expended in §3, 4 to achieve density 1 — e for 
the stopping was because eventually we wanted to guarantee that there would 
be no failures at all. 

7 Forward Estimation of the Conditional Prob- 
ability for Finitarily Markovian Processes 

Let the alphabet be finite or countably infinite. Now our goal is to estimate 
the conditional probability P(X„+i = x\Xq) on stopping times in a pointwise 
sense. 

Let A/" be a sequence of stopping times such that eventually almost surely 
-^n-K(X" )+i appears at least n^~^ times in Xq. 

Let pn be any estimate of the length of the memory from samples Xq such 
that Pn - K{X1^) ^ on 7\A. 

Define our estimate g„(x) of the conditional probability P(X„+i = x\Xq) on 
X as 

. . . ^ #{p. -l<^<n■. = X:_^^^,,X,^, = x} 

^ ' #{Pn - 1 < z < n : = 

Theorem 8 On n E M, 

\(jn{x) — P{Xn+i = x\Xq)\ almost surely. 

Corollary 1 For the stopping times M and estimator pn in Theorem^ The- 
oremlE holds and the density of Af is at least 1 — e. 
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To prove the above theorem we define Markov estimators of the conditional 
probabihties. Use A~j^(^„ defined in ([2]). Define the Markov estimator 
using j samples as 

<lii^) = ]tMx^_,. ^,^.}. (17) 

Lemma 7 Almost surely, 

max \qi{x) - P(X„+i = x\X^J\ -> 0. 

j>[nl-Tj 

Proof: Since by Lemma [H is an average of independent and iden- 

tically distributed bounded random variables so one may apply Hoeffding's 
inequality (cf. Hoeffding [9J or Theorem 8.1 of Devroye et. al. [7J): 

oo oo 

E PiKi^)-PiXn+i=^\X'^oo)\>^)< E 2e-^^'^- 

The right hand side is summable in n and the Borel-Cantelli lemma yields 
Lemma [3 The proof of Lemma [7| is complete. 

Proof of Theorem [51 Since eventually, for n G Af, Xn_K{x" )+i appears 



n 



1-7 



times in Xq so qn{x) = for some j > [n^ '^\, and P(X„+i 



x\Xq) = P{Xn+i = the result follows from Lemma [3 The proof of 

Theorem [8] is complete. 

8 Forward Estimation of the Conditional Prob- 
ability for Markov Processes 

Let {Xn} be a stationary and ergodic finite or countably infinite alphabet 
Markov chain with order K. Let ORDESTn be an estimator of the order 
from samples Xq such that ORDESTn — > K almost surely. Such an estima- 
tor can be found e.g. in Morvai and Weiss [I6]. Let n E M if X^_Qf^jj^grp^_^_i 
appears at least n^~^ times in Xq . A/" is a sequence of stopping times. Let 

^ _ jj^jORDESTn — 1 <i <n : XI_q^jj = X"_Q^^ggy^_^^, X^+i = x} 
4^{0RDESTn — 1 < i < n : Xl_Qj^j^^gj,^^^ = X^_Qj^^^gj.^^-^} 
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Theorem 9 Assume ORDEST^ equals the order eventually almost surely. Then 
on n E Af, 

\Qn{x) — P{Xn+i = x|X^_;^)| almost surely. 

and 

n — ^oo fi 

If the Markov chain turns out to take values from a finite set, then Af takes 
as values all but finitely many positive integers. 

To prove the above theorem we define Markov estimators of the conditional 
probabihties. Use XnXi defined in ([2]). Define the Markov estimator using j 
samples as 

?^(^) = 7El{x_.- (18) 



7 n-\ .+1 

J j = l n,K,i 



Lemma 8 Almost surely, 



rnax \qi{x) - P(X„+i = ^ 0. 

j>[ni-Tj 

Proof: The proof goes along the lines of the proof of Lemma The proof 
of Lemma [H] is complete. 

Proof of Theorem [9l Since ORDESTn = K eventually, and so for n E J\f: 
X^-K+i appears at least n^~'^ times thus qn{x) = ql{x) for some j > [n^'^^'^J 
and the result follows from Lemma [H Since any word of length K with 
positive probability appears eventually almost surely n^~'^ times in Xq thus 
Af has density one. If the alphabet is finite, then the number of words with 
length K is finite and by ergodicity, eventually almost surely all words with 
length K which has positive probability appears at least n^~'^ times. The 
proof of Theorem [9] is complete. 
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